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ABSTRACT 

This paper introduces the Bayesian Inference Engine (BIE), a general parallel-optimised soft- 
ware package for parameter inference and model selection. This package is motivated by the 
analysis needs of modern astronomical surveys and the need to organise and reuse expensive 
derived data. I describe key concepts that illustrate the power of Bayesian inference to address 
these needs and outline the computational challenge. The techniques presented are based on 
experience gained in modelling star-counts and stellar populations, analysing the morphology 
of galaxy images, and performing Bayesian investigations of semi-analytic models of galaxy 
formation. These inference problems require advanced Markov chain Monte Carlo (MCMC) 
algorithms that expedite sampling, mixing, and the analysis of the Bayesian posterior distribu- 
tion. The BIE was designed to be a collaborative platform for applying Bayesian methodology 
to astronomy. By providing a variety of statistical algorithms for all phases of the inference 
problem, a user may explore a variety of approaches with a single model implementation. 
Indeed, each of the separate scientific investigations above has benefited from the solutions 
posed for the other investigations, and I anticipate that the same solutions will be of general 
value for other areas of astronomical research. Finally, to protect one's computational invest- 
ment against loss any equipment failure and human error, the BIE includes a comprehensive 
persistence system that enables byte-level checkpointing and restoration. Additional technical 
details and download details are available from |http : //www, astro .umass . edu/bie[ 
The BIE is distributed under the GNU GPL. 

Key words: methods; data analysis - methods; numerical - methods; statistical - astronomical 
data bases; miscellaneous - virtual observatory tools 



1 INTRODUCTION 

Inference is fundamental to the scientific process. We may broadly 
identify two categories of inference problems: 1) estimation — 
finding the parameter of a theory or model from data; and 2) hy- 
pothesis testing — determining which theory, indeed if any, is sup- 
ported by the data. Astronomers increasingly rely on numerical 
data analysis, but most cannot take full advantage of the power 
afforded by present-day computational statistics for attacking the 
inference problem owing to a lack of tools. This is especially crit- 
ical when data comes from multiple instruments and surveys. The 
different data characteristics of each survey include varied selec- 
tion effects and inhomogeneous error models. Moreover, the infor- 
mation content of large survey databases can in principle determine 
models with many parameters but exhaustive exploration of param- 
eter space is often not feasible. 

These classes of estimation problems are readily posed by 
Bayesian inference, which determines model parameters, 6, while 
allowing for straightforward incorporation of heterogeneous selec- 
tion biases. In the Bayesian paradigm, current knowledge about the 
model parameters is expressed as a probability distribution called 
the prior distribution, it{0). This is the anticipated distribution of 
parameters for the postulated model before obtaining any measure- 
ments. This should include one's understanding of the model pa- 



rameters in their theoretical context. When new data D becomes 
available, the information content is expressed as P(D|0), the dis- 
tribution of the observed data given the model parameters. This will 
be familiar to some as the classical likelihood function, L(D|0). 
This information is then combined with the prior to produce an 
updated probability distribution called the posterior distribution, 
P{0\D). Bayes' Theorem defines this update mathematically: 



P{9\D) 
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(1) 



Many will recognise this as the multiplicative rule for conditional 
probability. Combined with the concept of sample spaces, measure 
theory, and Monte Carlo computation, Bayes theorem provides a 
rich framework for the quantitative investigation of a wide variety 
of inference problems, such as classification and cluster analyses, 
which broadly extends the two groups described above. Later sec- 
tions will illustrate the importance and utility of explicit quantifica- 
tion of the prior information. 

Why use the Bayesian framework? To begin, the Bayesian ap- 
proach unifies both aspects of the inference problem: estimation 
and hypothesis testing. For example, given a galaxy image and sev- 
eral families of brightness profiles, we would like to determine both 
the distribution of parameters for each family and which family is 
best supported by the data. A classical analysis would report the 



(c)2012RAS 



Maximum Li kelihood (ML ) estimate for each model using a x - 
type statistic jPearsonl 19001) and prefer the fit with the lowest value 
of x^ per degree of freedom. However, the x^ statistic will grow 
with sample size in the presence of measurement errors. This leads 
to the well-known over-fitting problem where the Pearson-type x^ 
test will reject the correct distribution in favour of one which better 
describes the deviations caused by the measurement errors. This 
well-known issue may be treated in a variety of ways, but the 
Bayesian approach naturally prefers the model with the smallest 
number of dimensions that can explain the data distribution through 
the specification of the prior information. The Bayesian approach 
further emphasises that model comparison problems must depend 
on the prior distribution. See i ]2.2l for more details. 

The computational complexity of applying equation (TJ di- 
rectly grows quickly with the number of model parameters and 
becomes intractable before the volume of currently available large 
data sets is reached. However, Monte Carlo algorithms based on 
Markov chains for drawing samples from the posterior distribu- 
tion promise t o make the Bayesian a pproach very widely appli- 
cable (e.g. see [Robert & Casellall2004l) . In turn, the application of 
Bayesian methods in cosmology and astrophysics has flourished 
over the pa st decade, sp urred by data sets of increasing size and 
complexity jTrottal2008n . Once a scientist can deteiTnine the poste- 
rior distribution, rigorous credible bounds on parameters and pow- 
erful probability-based methods for selecting between competing 
models and hypotheses immediately follow. This statistical ap- 
proach is superior to those commonly used in astronomy because it 
makes more efficient use of all the available infoiTnation and allows 
one to test astronomical hypotheses directly. To realise this promise 
for astronomical applications, we need a software system designed 
to handle both large data sets and large model spaces simultane- 
ously. 

Beginning in 2000, a multidisciplinary investigator team from 
the Departments of Astronomy and Computer Science at UMass 
designed and implemented the Bayesian Inference Enginqj, a 
Markov chain Monte Carlo (MCMC) parallel software platform 
for performing statistical inference over very large data sets. We 
focused on probability-based Bayesian statistical methods because 
they provide maximum flexibility in incoiporating and using all 
available information in a model-data comparison. For example, 
multiple data sources can be naturally combined and their selec- 
tion effects, which must be specified by the data provider to obtain 
a meaningful statistical inference, are easily incorporated. In this 
way, the BIE provides a platform for investigating inference using 
the virtual observatory paradigm. 

This paper has multiple goals. I begin by introducing in SJ2] 
the concepts in Bayesian inference that illustrate its power as a 
framework for parameter estimation and model selection for as- 
tronomical problems. This power, not surprisingly, comes with sig- 
nificant computational challenges that informs our design for the 
BIE; the features of the package are described in Appendix [A] I 
then describe and motivate our choice of MCMC algorithms (©; 
the BIE allows additional algorithms to be straightforwardly added 
as needed. This is followed by a brief summary of BIE-enabled 
research in ^ I summarise in ^ 



^ See |http: //www, astro. umass .edu/bie] for detailed descrip- 
tion and download instructions. 



2 WHAT DO ASTRONOMERS WANT AND NEED? 



2.1 Parameter estimation 

Many astronomical data analysis problems are posed as parameter 
estimates. For example: 1) one measures a spectral energy distribu- 
tion of an object and would like estimate its temperature; or 2) one 
measures the flux profile of a disk galaxy and would like to estimate 
its scale length. In these problems, one is asserting that the under- 
lying model is true and testing the hypothesis that the parameter, 
temperature or scale length, has a particular value. 

Bayesian inference approaches these problems with the fol- 
lowing three steps, reflecting the standard practice of the scientific 
method: 1) numerically quantify a prior belief in the hypothesis; 2) 
collect data that will either be consistent or inconsistent with the 
hypothesis; 3) compute the new confidence in the hypothesis given 
the new data. These steps may be repeated to achieve the desired 
degree of confidence. A clever observer will design campaigns that 
refine confidence efficiently (i.e., that makes the confidence high or 
low). In the context of our simple examples, one may believe that 
the spectral energy distribution is that of an M-dwarf star and one's 
prior belief is then a distribution of values centred on 2000 K. After 
measuring the spectral energy distribution, the prior distribution of 
temperature is combined with the probability of observing the data 
for a particular temperature, to get a refined distribution of the tem- 
perature of the object. Notice that this procedure does not result in a 
single value. Rather, the posterior probability distribution is used to 
estimate a credible interval (also known as a Bayesian confidence 
interval). Of course, credible intervals and regions are only a simple 
summary of the information contained in the posterior distribution. 
Unlike classical statistics, Bayesian inference does not rely on a 
significance evaluation based on theoretical or empirical reference 
distributions that are valid in the limit of very large data sets. Rather 
it specifies the probability distribution function for the parameters 
explicitly based on the data at hand. 

A prime motivation for the BIE project is the thesis that the 
power of expensive and large survey data sets is underutilised by 
targeting parameter estimation as the goal. To illustrate this, let 
us consider the second example above: estimating the scale length 
of a disk. A standard astronomical analysis might proceed as fol- 
lows. One determines the posterior probability distribution for scale 
lengths for some subset of survey images. Alongside scale length, 
one determines other parameters such as luminosity, axis ratios, or 
inclinations, and possibly higher moments such as the asymmetry. 
The scale length with maximum probability becomes the best es- 
timate and is subsequently correlated with some other parameter 
of interest, luminosity or asymmetry, say. Then, any correlation is 
interpreted in the context of theories of galaxy formation and evolu- 
tion. Observe, that in the first step, one is throwing out much of the 
information implicit in the posterior distribution. In particular, the 
luminosity estimate is most likely correlated with the scale-length 
estimate. If one were to plot the posterior distribution in these two 
parameters, one might find that the distribution is elongated in the 
scale-length-asymmetry plane, possibly in the same sense as the 
putative correlation! In other words, the confidence in the hypoth- 
esis of a correlation should include the full posterior distribution 
of parameter estimates, not just the maximum probability estimate. 
See i ]4.2l and Figure|3]for a real-world example. 

Moreover, this scenario suggests that one is using disk scale 
length and asymmetry as a proxy for testing a hypothesis about disk 
evolution or environment. These results might have been more re- 
liable if the observational campaign had been designed to enable a 
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hypothesis test, not a parameter estimate, from the beginning. This 
leads naturally to the following question. 



2.2 Which model or theory is correct? 

This question is a critical one for the scientific method. As- 
tronomers typically do not address it quantitatively but want to do 
so. I will separate the general question "which model is correct?" 
into two: 1) "does the model explain the data?", the goodness-of-fit 
problem; and 2) "which of two (or more) models better explains 
the data?", the model selection problem. Let us begin with (1) and 
discuss (2) in the next section. 

Suppose one has performed a parameter estimation and de- 
termined the parameter region(s) containing a large fraction of the 
probability. Before making any conclusions from the application of 
a statistical model to a data set, an investigator should assess the fit 
of the model to make sure that the model can explain adequately 
the important aspects of the data set. Model checking, or assessing 
the fit of a model, is a crucial part of any statistical analysis. Seri- 
ous misfit (failure of the model to explain important aspects of the 
data that are of practical interest) should result in the replacement 
or extension of the model. Even if a model has been assumed to be 
final, it is important to assess its fit to be aware of its limitations 
before making any inferences. 

The posterior predictive check ( PPC) is a commonl y-used 
Bayesian model evaluation method (e.g. lGelman et al.ll995l . Chap. 
6). It is simple and has a clear theoretical basis. To apply the 
method, one first defines a set of discrepancy measures. A dis- 
crepancy measure, like a classical test statistic, measures the dif- 
ference between an aspect of the observed data set and the theoret- 
ically predicted data set. Let M denote the model under considera- 
tion. Practically, a number of predicted data sets are generated from 
P{Y)\9* ,M) with 6* selected from the posterior distribution. Any 
systematic differences between the observed data set and the pre- 
dicted data sets indicate a potential failure of the model to explain 
the data. For example, one may use the distribution of a discrepancy 
measure based on synthetic data generated from the posterior dis- 
tribution to estimate a Bayesian p- value for the true data under the 
model hypothesis. The p- value in this context is simply the cumu- 
lative probability for the discrepancy statistic. A p-value in the tails 
of the predicted discrepancy-measure distribution suggests a poor 
fit to the data. By using a variety of different discrepancy statis- 
tics, one's understanding of how the model does not fit the data is 
improved. See i]3.5.1l for more detail. 

Another approach attempts to fit a non-parametric model to 
the data. If the non-parametric model better explains the data than 
the fiducial model, one rejects the fiducial model as a good fit. A 
procedure for assessing the model families will be described in the 
next section. A naive implementation of this idea is difficult, re- 
quiring a second high-dimensional MCMC simulation to infer the 
posterior distribution for the non-parametric model and a careful 
spec ification of the prior distributio n. A clever scheme for doing 
this JVerdinelli & Wassermanlll998h is described in i]3.5.2l 



2.3 Model selection and Bayes factors 

One often has doubts about our parametric models, even those that 
fit. This is especially true when the models are phenomenolog- 
ical rather than the results of first-principle theories. Therefore, 
one needs to estimate which competing model better represents 
the data. This approach has been applied to good advantage in in- 
terpreting the spatial fluctuations in the cosmic microwave back- 



ground radiation. The Bayesian model selection approach has been 
shown to efficiently answer questions such as: what is the best fit 
combinati ons of baryon ic, dark matter and dark energy compo- 
nents? See lTrottal ( 120081) for a review. 

Astronomers are becoming better versed in the more tradi- 
tional statistical rejection tests but astronomers often really want 
acceptance tests. Bayes factors provide this: one can straightfor- 
wardly evaluate the evidence in favour of the null hypothesis rather 
than only test evidence for rejecting it. Bayes factors are the dom- 
inant method for Bayesian model selectio n and are analogous 
to likelihood ratio te sts (e.g. |jeffrevslll96ll: lOelman et al1ll995l : 
iKass & Raftervll 19951) . Rather than using the posterior extremum, 
one marginalises over the parameter space to get the marginal prob- 
ability of the data under each model or hypothesis. The ratio of the 
likelihood functions marginalised over the prior distributions pro- 
vides evidence in favour of one model specification over another 
In this way, the Bayesian approach naturally includes, requires in 
fact, that one's prior knowledge of the model and its uncertainties 
be included in the inference. Although this dependence on the prior 
probability sometimes criticised as a flaw in the Bayesian approach, 
I believe this is as it should be: one's prior belief will invariably in- 
fluence one's interpretation of a statistical finding. The Bayesian 
framework allows the scientist to describe and incorporate prior 
beliefs quantitatively. In addition, the method demands that the sci- 
entist thoughtfully characterise prior assumptions to start; such dis- 
cipline will improve the quality of any scientific conclusions. 

Bayes factors are very flexible, allowing multiple hypotheses 
to be compared simultaneously or sequentially. The method selects 
between models based on the evidence from data without the need 
for nestings The posterior probability for competing models can be 
evaluated over an ensemble of data and used to decide whether or 
not a particular family of models should be preferred. Similarly, 
common parameters can be evaluated over a field of competing 
models with appropriate posterior model probabilities assigned to 
each. A tutorial illustrating this can be found in the BIE documen- 
tation. 

Mathematically, Bayes factors follow from applying Bayes 
Theorem to a space of models or hypotheses. Let P(M) be our 
prior belief in Model M, and let P{Ti\M) be the probability of 
observing D under the assumption of Model M. The Bayes Theo- 
rem tells us that the probability of Model M having the observed 
Dis 



P(M\T>) 



P{M)P{T>\M) 
P(D) 



(2) 



where -P(D) is some unknown normalisation constant. However, 
one may use equation (O to compute the relative probability of two 
competing models. Mi and Mf. 



P(.Mi|D) _ P{Mi) P{Ti\Mi) 
P{M2\D) ^ P{M2) P{T)\M2) 



(3) 



without reference to the unknown normalisation. The left-hand side 
of equation ^ may be interpreted as the posterior odds ratio of 
Model 1 to Model 2. Similarly, the first term on the right-hand side 
is the prior odds ratio. The second term on the right-hand side is 
called the Bayes factor Most often, one does not assert a preference 
for either model and assigns unity to the prior odds ratio. 

To define Bayes factors explicitly in terms of the posterior dis- 
tribution, suppose that one observes data D; these may comprise 

^ Two models are nested if they share the same parameters and one of them 
has at least one additional parameter. 
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Table 1. Jeffreys' table 
logi3i2 Bi2 Strength of evidence 



<0 


< 1 


Negative (supports A/2) 


to 1/2 


1 to 3.2 


Barely worth mentioning 


1/2 to 1 


3.2 to 10 


Positive 


lto2 


10 to 100 


Strong 


>2 


> 100 


Very strong 



many observations or multiple sets of observations. One wishes to 
test two competing models (or hypotheses) AIi and AI2, each de- 
scribed by its own set of parameters, 0i and 02- One would like 
to know which of the following likelihood specifications is better: 
Ml : Li(D|0i) or M2 : L2(D| 62), given the prior distributions 
7ri(0i) and 7r2(02) for di and 02- The Bayes Factor B12 is given 
by 

P(D|Mi) _ / 7ri(0iiMi)Pi(Dj6»i,A/i)d0i 



B12 



(4) 



P(D|Af2) / TT2{O2\M2)P2{'D\02, M2)d92 

If B12 > 1, the data indicate that Mi is more likely than A/2 and 
vice verse. Harold Jeffreys (1961, App. B) suggested the often-used 
scale for interpretation of B12 in half-unit steps in log B12 (see Ta- 
ble[TJ. This provides a simple-to-use, easily discussed criterion for 
the interpretation of Bayes factors. Note that classical hypothesis 
testing gives one hypothesis (or model) preferred status (the null 
hypothesis) and only considers evidence against it; the Bayes fac- 
tor approach is considerably more general. 

Given all of these advantages, why are Bayes factors not 
more commonly used? There are two main difficulties. First, mul- 
tidimensional integrals are difficult to compute. Following equa- 
tion l|4]l, one needs to evaluate an integral of the form: P(D) — 
f n{d)P{'D\6)dO. For a real world model, the dimensionality of 9 
is likely to be > 10. Such a quadrature is infeasible using standard 
techniques. On the other hand, a typical MCMC calculation has 
generated a large number of evaluations of the integrand at consid- 
erable expense. Can one use the posterior sample to evaluate the 
inte gral? 

Raftervl ( Il995h suggests a Laplace-Metropolis estimator that 



uses the MCMC posterior simulation to approximate the marginal 
density of the data using Laplace's approximation (see Raftery op. 
cit. for details). In practice, this is only accurate for nearly Gaussian 
(or normal, eg. IB lb unimoda l poste rior distributions. As part of the 
BIE development, IWeinbergI ( 120121) described two new approaches 
for evaluating the marginal likelihood from the MCMC-generated 
posterior sample and both of these are implemented in the BIE as 
secondary analysis routines (see i]3.2t . In short, I believe that the 
BIE together with recent advances for computing the marginal like- 
lihood makes the wholesale computation of Bayes factors feasible 
in many cases of interest. 

A second well-known difficulty is the sensitivity of Bayes fac- 
tors to the choice of prior. Most commonly, researchers feel that 
vague priors are more appropriate than informative priors. This 
leads to an inc onsistency known as the Jeffreys-Lindley paradox 
JLindlevI 19571) . which shows that vague priors result in overwhelm- 
ing odds for or against a hypothesis by varying the parameter that 
controls the vagueness (e.g. extending the range of an arbitrary uni- 
form distribution). This apparent problem has led researchers to 
seek Bayesian hypothesis tests that are less sensitive to their prior 
distributions. Conversely, I argue that one should not expect a vague 
prior to yield a sensible model comparison. Rather, the prior should 
be used to express prior belief in a theory and, therefore, the result- 
ing hypothesis test should be sensitive to the prior. This sensitivity 



implies that the theory implicit in the model is informed by one's 
background knowledge. Nonetheless, the prior knowledge is diffi- 
cult to quantify and I would still advocate testing a variety of prior 
distributions consistent with one's prior knowledge. This may be 
tested through direct sensitivity analyses, such as resimulation with 
chains at different resolutions and approximate priors. 

Regardless of one's viewpoint, I believe that the BIE project 
cuiTently provides a useful platform for investigating the use of 
Bayesian model comparison and hypothesis testing and hope that 
it will help pave the way for new applications. In some cases, com- 
puting the Bayes factor will be infeasible. For these, the BIE in- 
cludes an MCMC algorithm that selects between models as part of 
the posterior simulation (reversible jump) as described in i]3.3l 



2.4 Observational requirements 

The probability of the data given the parameter vector and the 
model, P{D\9,M) or the likelihood function, is fundamental to 
any inference, Bayesian or otherwise. Meaningful inferences de- 
mand that the data presentation include all of the information nec- 
essary for the modeller to compute P{D\6, M) accurately and pre- 
cisely. The more direct the construction of P{D\9,M) from the 
physical theory, i.e. the less information lost in modeling the ac- 
quired observations, the easier it is to calculate P{D\6, M), lead- 
ing to a quality result. In other words, the more the data is reduced 
through summary statistics and "cleaned" by applying complex 
filters, the less information remains and the greater the effect of 
difficult-to-model correlations. This is somewhat contrary to stan- 
dard practice. 

In addition, astronomers often quote their error models in the 
form of uncoiTelated standard errors. The customary expectation 
is that each datum, typically a data bin or pixel, should be within 
the range specified by the error bar most of the time. Quoted error 
bars are often inflated to make this condition obtain. This leads to 
a number of fundamental flaws that makes the error model (and 
therefore the data) unsuitable for Bayesian inference: 

(i) Binned and pixelated data are nearly always correlated. For 
example, a flat-field photometric correction correlates the pixels of 
an image over its entire scale. Sky brightness removal has similar 
effects. There are many additional sources of indirect correlations. 
Parameter estimations are often sensitive to these correlated excur- 
sions in the data values and ignoring these correlations will lead to 
erroneous inferences. Data archivists can facilitate accurate infer- 
ences by providing correlation matrices for all error models. 

(ii) Selection effects must be modeled in the likelihood function 
and, therefore, these effects must be well specified by the archivist 
to facilitate straightforward computation. For example, consider a 
multiband flux-limited source catalogue. A colour-magnitude or 
Hess diagram in two flux bands will have a non-rectangular bound- 
ary owing to the flux limit. Although this is a simple example, se- 
lection effects may be terribly difficult to model; consider spatial 
variations in source completeness owing to the diffraction spikes 
from bright stars. 

(iii) Astronomers tend to use historically familiar summary data 
representations that inadvertently complicate the computation of 
P{D\0,M). Continuing the previous example, the magnitude- 
magnitude diagram contains the same information as the colour- 
magnitude diagram but the selection effects lie along flux-level 
boundaries. For a more complicated example, consider the Tully- 
Fisher diagram. The input data set may contain flux limits, mor- 
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phology selections, image inclination cuts, redshift range limits, 
just to name a few. 

In summary, data processing and reduction, con^elates the data 
representation and complicates the computation of P{Yi\6,JV[). 
This renders the modeling process difficult and potentially unreli- 
able. Rather, one should endeavour to separate each source of error, 
carefully specifying the underlying acquisition process for every 
observational campaign. For example, each pixel datum in a digital 
image may be characterised by the observed data number (dn), gain 
and bias, read noise, thermal background fraction, etc. Together, 
their combination yields an error model. Even if a first-principle 
process cannot be described, an empirical distribution or process 
description will be helpful to modelers. For example, a distribu- 
tion of deviations for measured pixel values relative to a reference 
calibration field may be used as an error model. Ideally, the data 
representation should be as close to the acquired form as possible. 
In cases where the archiving or presentation of source data is im- 
practical, the production of a correlation matrix is essenti^; 

The effect of data correlation has been explored by ILu et al.l 

( 1201 ih . They describe the parameter inference for a semi-analytic 
model of galaxy formation conditioned on a galaxy mass function 
with both correlated and uncoiTelated data bins. The differences 
in the posterior distributions for these two cases is dramatic. When 
the error model is in doubt, the sensitivity of the inference to the er- 
ror model can be investigated in the Bayesian paradigm by putting 
prior distributions on parameters of the en^or models that describe 
their uncertainty, marginalising over those hyperparameters, and 
comparing with the original posterior. Although this is more ex- 
pensive and rarely done, one should consider performing such sen- 
sitivity analyses regularly. 



3 SOLUTIONS PROVIDED BY THE BIE 

Assume that one has identified the observational selection effects, 
specified the prior distributions, and constructed P{D\0,M). How 
does one compute and the posterior distribution defined by equation 
(0 using MCMC? 



3.1 Computing the posterior distribution 

This section presents four MCMC sampling algorithms of increas- 
ing complexity included in the BIE. I begin with a description and 
some motivation for the standard Metropolis-Hastings algorithm. 
This simple easy-to-use and implement algorithm often fails to 
converge and is difficult to tune for complicated high-dimension 
distributions. The next three sections introduce modifications that 
circumvent these pitfalls. 



3.1.1 The Metropolis-Hastings algorithm 

Metropolis-Hastings is the most well-kn own of MCMC algorithms 
JMetroDolis et al.lll953KlHastingslll97(ll) . This algorithm constructs 
a Markov chain that generates states from the P{0\D, JVi) distri- 
bution after a sufficient number of iterations. Its success requires 
that the Markov chain satisfy both an ergodicity and a detailed bal- 
ance condition. The ergodicity condition ensures that at most one 
asymptotic distribution exists. Ergodicity would fail, for example, 
if the chain could cycle back to its original state after a finite num- 
ber iterations. Ergodicity also requires that all states with positive 
probability be visited infinitely often in infinite time {cAleA positive 



recurrence). For general continuous state spaces, ergodicity is read- 
ily achieved. The detailed balance condition ensures that the chain 
admits at least one asymptotic distribution. Just as in kinetic the- 
ory or radiative transfer, one defines a transition process that takes 
an initial state to a final state with some probability. If the Markov 
chain samples the desired target distribution P{6) — P{6\'D,A4), 
detailed balance demands that the rate of transitions — >■ 0' is the 
same as the rate of transitions from 6' — > 6. 

One may state this algorithm explicitly as follows. Let P{0) 
be the desired distribution to be sampled and q{6, 9') be a known, 
easy-to-compute transition probability between two states. Given 
6, the distribution q{0, 0') is a probability distribution for 0' . Let 
a{9, 6') be the probability of accepting state 6' given the current 
state G. In short, one can show that if the detailed balance condition 

P{9)q{e, 6')a{0, 6') = P{0')q{0\ e)a{0\ 6) (5) 

holds, then the Markov chain will sample P{6). It is straightfor- 
ward to verify by substitution that acceptance probability 

a(0, e') = min {1, [P{0')q{e' , e)/P{0)q{e, e')]] (6) 

solves this equation (for additional discussion see |Liu|2004) . Equa- 
tion ([5j has the same form as well-known kinetic rate equations as 
follows. Given the probability of a transition over some time inter- 
val from a state A to some other state B of a physical system and 
the corresponding reverse reaction, then the equilibrium condition 
for A'^ = Na + Nb systems distributed in the two states is: 

NaP{A -> B) = NbP{B -^ A). 

In equation Jsj' the probability densities P{6) and P{0') play the 
role of the occupation numbers Na and Nb and the product of the 
transition and acceptance probabilities play the role of the proba- 
bilities P{A -s- B) and P{B -)■ A). 

The transition probability q(6, 6') is often chosen to facilitate 
the generation of 0' from 0. [Metropolis et al.l l ll953h introduced 
a kernel-like transition probability q{0,d') = q{d — 0') where 
q[-) is a density. This has the easy-to-use property of generating 
6' — + ^ where ^ ~ (?. (Here and throughout, the notation 
6 ~ P{6) expresses that the distribution function of the variate 9 is 
P(0).) Further, if g is symmetric, i.e. g(z) = g(— z), then equation 
^ takes the simple form 



a{6, e') = min <^ 1 



PJO') 
P{0) 



(7) 



The BIE provides two symmetric distributions for q: a multivari- 
ate normal and a unifomi or top-hat distribution. Each element of 
6 is scaled by a supplied vector of widths, w. The choice of w is 
critical to the performance of the algorithm. If the width elements 
are too large, P{0')/P{9) will tend to be very small and proposed 
states will rarely be accepted. Conversely, if the width elements are 
too small, the new state frequently will be accepted and successive 
states will be strongly correlated. Either extreme leads to process 
that is slow to reach equilibrium. The optimal choice is somewhere 
in between the two. As the dimensionality of the parameter space 
grows, specifying the optimal vector of widths a priori is quite dif- 
ficult. I will address this difficulty in i ]3.I.4| 



3.1.2 Tempered transitions 

In addition to the inherent difficulties associated with tuning the 
transition probability, the Metropolis-Hastings state can easily be 
trapped in isolated modes, between which the Markov chain moves 



© 2012 RAS, MNRAS OQO.fTlflTl 



only rarely. This prevents the system from achieving detailed bal- 
ance and, thereby, prevents sampling from the desired target dis- 
tribution P{6). There are a number of techniques for mitigating 
this so-called mixing problem. For the BIE, I adopte d a synthesis 
of Metropolis-coupled Markov chains (JGeyer 19911) and a sim- 
ulated tempering method proposed by iNeal ( 1996) called tem- 
pered transitions. To sample from a distribution Po{0) = P{0) 
with isolated modes, one defines a series of n other distributions, 
Pi{0), . . . , Pn{0), with Pk being easier to sample than Pk-i- For 
example, one may choose 



P,(0)cxP,^(0) 



(8) 



with 1 = ^0 > /3i > • • • > Pn-i > /?„ > 0. This construc- 
tion has a natural thermodynamic interpretation. One may write 



/3fclog(Po) 



„log(Po)/Tfc 



The distribution with tem- 



PS''{e) = e 

perature To = T = 1 is the original distribution. Hotter distri- 
butions have higher temperature T^ > To and are over-dispersed 
compared with the original cold distribution. In other words, tak- 
ing a distribution function to a small fractional power decreases the 
dynamic range of its extrema. In the limit T^ — >■ oo, P^ becomes 
uniform. Next, the method defines a pair of base transitions for each 
k, Tk and Tk, which both have Pk as an invariant distribution and 
satisfy the following mutual reversibility condition for all G and 0' : 

Pkie)fkie,e') = tk{0' ,e)Pkie'). 

A tempered transition first finds a candidate state by apply- 
ing the base transitions in the sequence T\- ■ -Tn. After each up- 
ward transition, new states are sampled from a broader distribu- 
tion. In most cases, this liberates the candidate state from confine- 
ment by the mode of the initial state. This is then followed by a 
series of downward transitions Tn ■ ■ -Ti- This candidate state is 
then accepted or rejected based on ratios of probabilities involving 
intermediate states. Thermodynamically, each level k corresponds 
to an equilibrium distribution at temperature Tk. Therefore, the up- 
ward or downward transitions correspond to heating or cooling the 
system, respectively. One chooses the maximum temperature to be 
sufficiently high to melt any structure in the original cold posterior 
distribution that would inhibit mixing. 

Explicitly, the algorithm proceeds as follows. The chain be- 
gins in state Go and obtains the candidate state, 6o, as follows. For 
j = Oton — 1, generate 0j+i from 6j using Tj+i. Set 6n = On. 
Then, for j = n to 1, generate 0j- 
date state. Go, is then accepted with probability 



PoiOo) 



P„(0„_l) P„_l(0„-l) Po(0o) 



l(0„_l) P„(0n-l) 



Pi(0o) 



(9) 



Although equation ^ looks complicated, the derivation in iNeall 
( 119961) shows that it immediately follows by recursively applying 
equations ^ and l[6}. Then, owing to the mutual reversibility con- 
dition, the T and T dependence in equation l|9) cancels. If the can- 
didate state is not accepted, the next state of the Markov chain is 
the same as the original state. Go. In practice, I 'bum-in' the chain 
at each level j for M iterations, with Af = 20 typically. Note that 
each Pi occurs an equal number of times in the numerator and de- 
nominator in equation (|9). Therefore, the acceptance probability 
can be computed without knowledge of the normalisation constants 
for these distributions. If the acceptance probability is to be reason- 
ably high, properly-spaced intermediate distributions will have to 
be provided that gradually interpolate from Pq to P„. Thermody- 
namically, this con'esponds to an adiabatic increase of the heat-bath 
temperature followed by an adiabatic return to the original tem- 
perature. The maximum temperature T„ — I3~^ should be large 
enough to break any barriers between modes at low temperature 



but not so large that P„(0) provides no constraint on the distribu- 
tion of G. Since this value depends on the features of P{6) that are 
not known a priori, I choose T„ by checking the distribution of 6 
for Pn with various values of r„. Our tests have shown that this al- 
gorithm works remarkably well for a variety of different inference 
problems. 

Ma ny readers will be fami liar with the idea of simulated an- 
nealing l IKirkpatrick et al.lll983n . Each step of a simulated anneal- 
ing algorithm proposes to replace the current state by a state ran- 
domly constructed from the current state by a transition probability. 
The magnitude of the excursion allowed by the transition probabil- 
ity is controlled by a temperature-like parameter that is gradually 
decreased as the simulation proceeds. This prevents the state from 
being stuck at local maxima. The tempered transitions algorithm 
is heuristically similar to simulated annealing with the important 
additional property of obeying detailed balance. 

3.1.3 Parallel tempering 

The parallel tempering algorithm inverts the order of the previous 
algorithm: it simultaneously simulates n chains, each with target 
distribution Pj (eq. [8]l and proposes to swap states between ad- 
jacent members of the sequence at predefined intervals. The high 
temperature chains are generally able to sample large volumes of 
parameter space, whereas low temperature chains, may become 
trapped in local probability maxima. Parallel tempering achieves 
good sampling by allowing the systems at different temperatures to 
exchange states at very different locations in parameter space. The 
higher temperature chains often achieve detailed balance quickly 
and accelerate the convergence of the lower temperature chains. 
Thus, this method may allow a simulation to achieve detailed bal- 
ance even in the presence of widely separated modes. In some sit- 
uations, parallel tempering outperforms tempered transitions with 
a lower overall computational cost. In addition, the parallel tem- 
pering algorithm is trivially parallelised by assigning each chain 
its own process. The tempered transitions algorithm is intrinsically 
serial and parallel efficiency is only obtained if the likelihood com- 
putation is parallelisable. 

The parallel tempering algorithm proceeds as follows. At each 
step, a pair of adjacent simulations in the series is chosen at random 
and a proposal is made to swap their parameter states, whose accep- 
tance is determined using a Metropolis-Hastings criterion. Let the 



j iterate of the state in the k 
is accepted with probability 



chain be denoted as ' . The swap 



a = min 



][fe+il 



Pk{G';^''\ii,M)Pk+i{0r\^,M) 



[fell 



,W, 



i[fe+ili 



(10) 



Pk{Gp\r>,M)Pk+i{Sr^'\^'^l 
where Pk{6\T),A4) is the posterior probability of G given the data 
D for chain k and model assumptions A4. Final results are based 
on samples from the /3o = 1 chain. As in the tempered transi- 
tions algorithm, the high-temperature states will mix between sepa- 
rated modes more efficiently, and subsequent swapping with lower- 
temperature chains will promote their mixing. 

The analog thermodynamic system here is an array of sys- 
tems with the same internal dynamics at different temperatures. 
At higher temperatures, strongly 'forbidden' states are likely to re- 
main forbidden but valleys between multiple modes are likely to be 
more easily crossed. In contrast to tempered transitions (i j3.1.2| l, the 
proposed transitions in parallel tempering are sudden exchanges of 
state between systems of possibly greatly different temperature. As 
with tempered transitions, the algorithm obeys the detailed balance 
equation. 
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3.1.4 Differential evolution 

Real-world high-dimensional likelihood functions often have com- 
plex topologies with strong anisotropics about their maxima (see 
i]4.1| Fig.[2j. Difficulties in tuning the Metropolis-Hastings transi- 
tion probability to achieve both a good acceptance rate and good 
mixing plagues high-dimensional MCMC simulations of the pos- 
terior probability. This pr oblem affects all o f algorithms discussed 
up to this point. Recently iTerBraakI feOOd) introduced an M CMC 
variant of a genetic algorithm called differential evolution jPricd 
ll997l : IStom & Pricelll997l : IStomll999h . This version of differential 
evolution uses an ensemble of chains, run in parallel, to adaptively 
compute the Metropolis-Hastings transition probability. In all that 
follows, I will refer to the MCMC variant as simply differential 
evolution. 

The algorithm proceeds as follows. Assume that our ensem- 
ble has n chains to start, e.g., initialised from the prior probability 
distribution. Each chain has the same targ et distributio n P{0). The 
original differential evolution algorithm JPricdl 19971) proposes to 
update member i as follows: Op = 0^j, + 7(0^j^ — 6^2) where 
RQ, Rl, R2 are randomly selected without replacement from the 
set {1,2, ... ,n}. The proposal vector replaces the chosen one 
if P(0jj'') > Pfei^'l. Ilbr BraakI ( l200d) shows that with minor 
modifications the transition probability and the acceptance con- 
dition for differential evolution obeys detailed balance. The new 
MCMC version of the differential evoluation algorithm takes the 
fonaOp — S'"* +7(0^j —6^2) + ^ where eis drawn fromasym- 
metric distribution with a small variance compared to that of the 
target, but with unbounded support such as a d dimensional normal 
distribution (eq. lBlt with very small variance a^: e ~ N'^{Q, a^). 
The random variate e is demanded by the recurrence condition: the 
domain for non-zero values of the posterior P must be reached in- 
finitely often for an infinite length chain. The proposal Gp is ac- 
cepted as the next state for Chain i with probability 



a = mm 



P{e,) 



pm 



In essence, differential evolution uses the variance between a pop- 
ulation of chains whose distributions are converging to the target 
distribution to automatically tune the proposal widths. Although 
the transition probability distribution q(0,9') does not have an an- 
alytic form in this application, the differential evolution algorithm 
enforces symmetry through the random choice of indices, and the 
distribution q{d, 0') clearly exists. 

Thi s algorithm as sta ted above does not address the mixing 
problem. iTer BraakI ( 120061) suggests including simulated tempering 
or simulated annealing in differential evolution. Along these lines, 
BIE also includes a hybridised differential evolution which periodi- 
cally performs a tempered transition step for all n chains in parallel. 
This provides an ensemble at each temperature for the upward and 
downward transitions. As in tempered transitions, we evolve each 
chain for Af steps at each temperature level. A typical number of 
temperature levels is twenty, and therefore, the addition of temper- 
ing may slow the algorithm by an order of magnitude. Although 
tempered differential evolution is dramatically slower than the sim- 
ple differential evolution, I have found it essential for achieving a 
converged posterior sample for many of our real- world astronom- 
ical inference problems. This met hod was applied to the Bayesian 
semi-analytic models described in 



3.1.5 Summary: choice of a MCMC algorithm 

I advocate performing a suite of preliminary simulations to explore 
the features of one's posterior distribution with various algorithms. 
My experience suggests that there is no single best MCMC algo- 
rithm for all applications. Rather, each choice represents a set of 
trade offs: more elaborate algorithms with multiple chains, aug- 
mented spaces, etc., are more expensive to run but may be the only 
solution for a complex posterior distribution. Conversely, an elab- 
orate algorithm would be wasteful for simulating a simple poste- 
rior distribution. Moreover, combinations of MCMC algorithms in 
multiple-chain schemes are often useful. 

For distributions with complex topologies, differential evolu- 
tion relieves the scientist of the task of hand selecting a transition 
probability by trial and error. This method has the advantage of 
added efficiency: states from all chains in a converged simulation 
provide valid posterior samples. However, this strategy may back- 
fire if the posterior is strongly multimodal because differential evo- 
lution requires multiple chains in each mode to enable mixing be- 
tween modes. Any single chain in a discrete mode will remain for- 
ever. Parallel chains and similar algorithms do not have this prob- 
lem. Although high-temperature chains from tempered methods do 
not sample the posterior distributi on, they do provid e useful in- 
formation for importance sampling. lYoon et al.l(l2012r) has produc- 
tively used high-temperature samples for Monte Carlo integration. 



3.2 Computation of Bayes factors and marginal lil^elihoods 



As described in i]2.3l the marginal likelihood plays a key role 
in Bayesian model selection. There are several common strate- 
gies for computing the marginal likelihood. The simplest is di- 
rect quadrature usin g multidimensional cubature algorithms (e.g. 
iBemtsen et al.ll99ir) . Computational complexity limits its applica- 
tion to approximately four or fewer dimensions. Secondly, one may 
approximate the integrand around each well-separated mode as a 
multivariate normal distribution and integrate the resulting approx- 
imation analytically. This is the Laplace approximation. It suits 
simple unimodal densities approximately Gaussian shape, but the 
posterior distributions for many real-world problems are far from 
Gaussian. Finally, one may rewrite Bayes theorem as an expres- 
sion that evaluates the normalisation constant from the posterior 
sample as the harmonic mean of the likelihood function, as will be 
shown below. In short, none of the three suffices in general: direct 
quadrature is most often computationally infeasible, the Laplace 
approximation works well only for simple posterior distributions 
and the harmonic mean approximation often has enormous vari- 
ance owing to its inv erse weighting by the like lihood value (see 
iKass&RaftervIl 19951) . To help address this lack. I Weinberg ( l2012h 
presents two computationally-modest families of quadrature algo- 
rithms that use the full sample posterior but without the instabil- 
ity of the harmonic mean approximation ( INewton & Raftervl 19941) 
or the specificity of the Laplace approximation JLewis & Raftervl 
Il997h . 

The first algorithm begins with the normalised Bayes theorem: 



z X P{e\r>) = n{e)L{'D\e) 



(11) 



where 



nod was applied I 
lLuetal.l ( l20TTI) . 



Z= [ d0P{e\T)) = / d07r(0)L(D]0) (12) 

in Jn 

normalises P(0|D) (as in eq.[Tll. The quantity Z is called the nor- 
malisation constant or marginal likelihood depending on the con- 
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text. Dividing by L(D|0) and integrating over 9 we have 

P(0|D) 



Z X 



dO 



L(D,.)-//«-(^)- 



(13) 



Since the Markov-chain samples the posterior, P{6\D), the com- 
putation of the integral on the left from the chain appears as an 
inverse weighting with respect to the likelihood. This is poorly con- 
ditioned owing to the inevitable small values of L{Y>\6). However, 
if the integrals in equation ( |13t are dominated by the domain sam- 
pled by the chain, the integrals can be approximated by quadrature 
over a truncated domain, Sis that eliminates the small number of 
the chain states with low L(D|0). More precisely, the integral on 
the left-hand-side may be cast in the following form: 



/ 

Jn 



p[e\D) 



n/^'i(D|0) 



dYM{Y). 



(14) 



This integral will be a good approximation to the original if the 
measure function defined by 



M{Y) 



i/L(D|e)>y 



dep{e\D) 



(15) 



decreases faster than L as L — > 0. Otherwise, the integral in equa- 
tion l lI4l l does not exist and the first algorithm cannot be used; see 
[Weinberg (2012i) for details. Intuitively, one may interpret this con- 
struction as follows: divide up the parameter space £ Q,s into 
volume elements sufficiently small that P{0\D) is approximately 
constant within each volume element. Then, sort these volume el- 
ements by their value of K(D|0) = L^^{T)\9). The probability 
element dM = M{Y + dY) - M{Y) is the prior probability of the 
volume between Y and Y + dY . However, if the truncated volume 
forms the bulk of the contribution to equation ( 114) . the evaluation 
will be inaccurate. 

To evaluate the r.h.s. of equation ( 113) , one may use the sam- 
pled posterior distribution itself to tessellate the sampled volume 
in Q,s C SI. This may be done straightforwardly using a space- 
partitioning structure. A binary space partition (BSP) tree, which 
divides a region of parameter space into two exclusive sub regions 
at each node, is particularly efficient. The most easily implemented 
tree of this type for arbitrary dimension is the kd-tree (short for 
k-dimensional tree). The kd-tree algorithms split R* on planes per- 
pendicular to one of the coordinate system axes. The implementa- 
tion provided for the BIE uses the median value along one of axes 
(a balanced kd-tree). I have also implemented a hyper-octree. The 
hyper-octree generalises the octree by splitting each n-dimensional 
parent node into 2" hypercubic children. Unlike the kd-tree, the 
hyper-octree does not split on point location and the size of the 
cells is not strictly coupled to the number of points in the sample. 
In addition, the cells in the kd-tree might have extreme axis ratios 
but the cells in the hyper-octree are hypercubic. This helps provide 
a better represen tation of the volume containing sample points. See 
IWeinberd ( 120120 for additional details, tests, and discussion. Ap- 
proximate tessellations also may be useful. For example, the near- 
est neighbour to every point in a sample could be used to circum- 
scribe each point by a sphere of maximum volume such that all 
spheres are non-overlapping. Compariso ns and performanc e details 
will be reported in a future contribution (lYoon et al.ll2012D . 

For cases where the integral in ( 1141 does not exist or the first 
algorithm provides is a poor approximation, Z may be evaluated 
directly using the second computational approach. Begin by inte- 
grating equation dlU over Sl^ C SI: 



Z X / dep{e\D) 



The Monte Carlo evaluation of the integral on the left-hand side 
is simply the fraction of sampled states in Q,s relative to the entire 
sample: Fq^ = X^eeo VX^eenl- ^^^ integral on the right- 
hand side may be evaluated using the space-partitioning procedure 
described above. Altogether, then, one has 



Fa 



d8n{e)L(D\e) 



(17) 



where Sis is ideally chosen to avoid regions of very low posterior 
probability. This method has no problems of existence for proper 
probability densities. 

There are several sources of error in this space partition. For 
a finite sample, the variance in the tessellated parameter-space vol- 
ume will increase with increasing volume and decreasing posterior 
probability. This variance may be estimated by bootstrap. As usual, 
there is a variance-bias trade-off in choosing the resolution of the 
tiling: the bias of the probability value estimate increases and the 
variance decreases as the number of sample points per volume ele- 
ment increases. Some practical examples suggest that the resulting 
estimates are not strongly sensitive to the number of points per cell. 

In summary, the choice between the various algorithms de- 
pends on the problem at hand. The Laplace approximation may 
be a good choice for posterior distributions that are unimodal with 
light tails but this is often not the case for real- world prob l ems. I 
investigate the performance of the al gorithms in IWeinberd (120121) 



Yoon et al.l ( l2012h . To date, I 



den{e)L{D\e). 



(16) 



for high-dimensional distributions in[ 
have reliably evaluated Z for n < 14 using a MCMC-generated 
samples of approximately 10^ points with auto-con^elation lengths 
of approximately 20. Additional perform ance details and t ests in 
astronomical applications are described in I Yoon et al.l(l2012n . 



3.3 Dimension switching algorithms 

When generating large sample sizes that are necessary for an accu- 
rate computation of the marginal likelihood is impractical, one may 
propose a number of different models and choose between them as 
part of the Monte Carlo sampling process. This is done by adding a 
discrete indicator variable to the state to designate the active model. 
The resulting state space consists of a discrete range for the indi- 
cator a nd of continu ous ranges for each of the parameters in each 
model. lGreenl ( ll995l) showed that the detailed balance equation can 
be formulated in such a general state space. This allows one to pro- 
pose models of different dimensionality and thereby incorporate 
model selection into the probabilistic simulation itself. The algo- 
rithm require s a transition probability to and from each sub space 
dGreenll 19951) . 

For example, suppose one has an image of a galaxy field that 
one would like to model with some unknown number of distinct 
galaxies k £ [l,n]. The extended sample space, then, consists of 
n subspaces, each one of which contains the parameter vectors for 
each of the k galaxies for each subspace. Suppose that the current 
state is in the k — 3 subspace; that is, the current model has three 
galaxies. With some predefined probability at each step, p(3,4), 
the algorithm proposes a transition to fc = 4 galaxies by splitting 
one of three galaxies into two separate but possibly blended com- 
ponents. Similarly, with some predefined probability at each step, 
p(4, 3), one defines the reverse transition by combining two adja- 
cent components into one component. Finally, no subspace transi- 
tion is proposed with the probability p(3, 3) = 1— p(3,4)— p(4,3). 
For model comparison, an estimate of the marginal probability for 
each model k follows directly from the occupation frequency in 
each subspace of the extended state space. 



© 2012 RAS, MNRAS OOO.fTlfTTl 



Bayesian Inference Engine 9 



To make this explicit following iGreenI ( Il995l) . one first de- 
fines reversible transitions between models in different subspaces, 
say i and j. This is accomplished by proposing a bijective 
function gij that transforms the parameters between subspaces 
gij{9^'',(j)^'') = {9'-^',4>"'), and enforces the dimensional match- 
ing condition d{e^'^) + d((j)^'^) = d{9^^^) + d((j}^^^) where the op- 
erator d(-) returns the rank of the vector argument. The param- 
eter vector 0'' is a random quantity used in proposing changes 
in the components and for choosing additional components when 
going to a higher dimension. The rank of <^'' may be zero. For 
example, if d{6^'') — 2 and d{6^-'') — 1 then one may define 
d(4> ) ~ and d{(f>^-'') = 1. In other words, for the purposes of 
inter-dimensional transitions, each subspace is augmented by ran- 
dom variates with the constraint that the dimensionality of the aug- 
mented spaces match. 

Then, if qij {9'-'' , (j>^'') is the probability density for the pro- 
posed transition and p{i,j) is the probability to move from sub- 
space i to subspace j, the acceptance probability may be written 



aij{9 



[i] nyu 



(18) 



ui^, 



fi(ebi|D)p(j,z)g,.(eM,,^ 
' P,{9l^]\D)p{i,j)qi,{9M,<pM) 



9(6iM,(^W) 



d{9M,(j)M) 



where the final term in the second argument of min{} is the 
Jacobian of the mapping between the augmented spaces, and 
Pj {9^-'' |D) is the posterior probability density for the model in sub- 
space j. The probability densities p{i, j) and g^ are selected based 
on prior knowledge of the problem and to optimise the overall rate 
of convergence. The algorithm can be summarised as follows. As- 
sume that the state at iteration i is in subspace rii with parameter 



vector 



,[">] 



One proposes a new state as follows: 



(i) Choose a new model j by drawing it from distribution 
p{ni, •). Propose a value for the parameter 9^-'' by sampling i^'"'' 
from the distribution q„j (6l!"*' , 0'"*' ) . 

(ii) Accept the move with probability a„;j (&'"'', S'^'). 

(iii) If the move is accepted, let ti^+i — j and 9^^^^^ — 9'-^'. 

(iv) If the move is not accepted, stay in the cuixent subspace : 
Ui+i = Ui and 6l._|_j+ = 9\ "-' . 



lOreenl ( Il995h named this algorithm Reversible Jump Markov chain 
Monte Carlo (RJMCMC). One often chooses to interleave RJM- 
CMC steps with some number of standard Metropolis-Hastings 
steps to improve the mixing in each subspace. 

Within the constraints, the transitions are only limited by one's 
inventiveness. However, I have found that the most successful tran- 
sitions are intuitively motivated by features of the scientific prob- 
lem. Although RJMCMC enables model selection between arbi- 
trary families of models with various dimensionality, the conver- 
gence of the MCMC simulation depends on the specification of the 
transition probabilities and a careful tuning of the MCMC algo- 
rithm. In addition, I have not tried RJMCMC with multiple-chain 
algorithms that propose transitions between chains. This appears to 
be formally sound but the requirement that the transitioning chains 
be in the same subspace may yield very long mixing times. Sim- 
ilarly, the differential evolution algorithm (see i]3.1.4| ( would re- 
quire that each subspace of interest be populated by some number 
of chains all times. 



Table 2. RJMCMC applied to the images in Figure[T] 

Model A3/A2 A-s/Atotai P(2)t p(3) p(4) 



1 


1.0 


0.167 


0.0 


0.9997 


0.0003 


2 


0.5 


0.091 


0.0 


0.9997 


0.0003 


3 


0.3 


0.057 


0.444 


0.556 


0.0003 


4 


0.2 


0.038 


0.963 


0.036 


0.001 


5 


0.0 


0.0 


0.998 


0.002 


0.0 



t p{m) is the probability of states in the subspace with m compo- 
nents. For these simulations, p(l) = p(5) = p(6) = 

3.3.1 Example: a simple transition probability 

Consider two models. Model 1 with two real parameters: 9'-'^' : 
(^1,^2) G TZ^ and Model 2 with one real parameter O'-^' : 
9 £ TZ. Let us assume that the prior probability of transi- 
tion between the models is p(l,2) — p(2,l) = 1/2 and 
adopt the transformation between the subspaces: 312(6*1, 6*2) ~ 
l{9i + 6'2)/2, {9i - 92)/2] = {9, <j)). The variable (f) is distributed 
as q{(j)). The inverse transformation is: 321 {9, u) — (9 + <j),9 — (f>). 
Therefore, given 9, one draws (f> from g((jf>) and immediately obtain 
(^1, ^2). The acceptance probability for (^1, S2) — >■ 9 becomes 



021 = 



7r2(ei,e2) 



d{9,^) 



d{9^ 



VTl 



' fll+Sg. > 



ii 



61-62 



2-K2{9l,92) 



and for 61 ^ (6li,6l2): 



Q12 — 



2-K2{9 + <l,,9-4>) 



(19) 



(20) 



3.3.2 Example: BIE mixture modeling 

Many problems in astronomy are mixtures of components drawn 
from the same model family. Each component j in the mixture is 
additively combined with a weight Wj such that X^jLi ^3 ~ 1- 
This allows the predefinition of some generic RJMCMC transitions 
that are likely to work for a wide variety of problems. I consider 
two types of transitions. The first is the birth and death of a com- 
ponent. The birth step is implemented by selecting a new compo- 
nent from the prior distribution for the component parameters for 
the user-specified model. The prior for the weights is chosen here 
to be a Dirichlet distribution (eq. IB4) with a single user-specified 
shape parameter since each component is assumed to be indigu- 
ishable a priori. Assume that the new component is bom in state 
with m components. The new m 4- 1 weight is selected from the 
prior distribution for m -f 1 weights after marginalising over m 
of the them. The m weights in the current state are scaled to ac- 
commodate the choice. The death of a component is the inverse 
of the birth step defined by detailed balance. The second type of 
transitions are split and join transitions; again, these are inverses. 
For the split step, one selects a component at random and splits 
each component with an additive or multiplicative shift as follows: 
6I1 =9o+69,92 = 9o-S9,ox9^ ^9o{l + e),92 = 6»o(l - e). 
For an example, consider a toy model for a group of galax- 
ies where the light from each galaxy has a normal distribution 
with the same width. Each image has two well-separated compo- 
nents with Ai — 4000 and A2 = 1000 counts. A tertiary com- 
ponent with five different amplitudes ^3 is added along the line 
between the two, separated by a half width. With this choice, the 
secondary and tertiary components are blended and appear as a 
single peak (see Fig.[T]for details). The results of applying the re- 
versible jump transitions above to these four images and one with 
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Figure 1. Four test models with three two-dimensional Gaussian distributions of width 0. 1 in each dimension. The three components have centres at (0.2, 0.2), 
(0.9, 0.9), (0.3, 0.3). The first and second components for all models were realised with 1000 and 4000 counts respectively. The third component for Models 
1, 2, 3 and 4 have have 1000, 500, 300, and 200 counts, respectively. Components Two and Three are barely distinguishable by eye, even for Models 1 and 2 
and appear as an elongation. The "by eye" differences between Models 3, 4 and Model 5 with zero tertiary amplitude are not easily distinguished. 



an empty tertiary component is described in Table|2] The total num- 
ber of counts is Atotai = X]i=i ^i '^nd p(m) denotes the proba- 
bility of states in subspace m. The prior probability on the centre 
coordinates is uniform in the range (—0.2, 1.2). The prior prob- 
ability on each subspace with m £ [1,6] components is Poisson 
with a mean of 1; that is, I am intentionally preferring fewer com- 
ponents, but tests with Poisson means of 2 and 3 show that the re- 
sults are insensitive to this choice. The MCMC simulations use the 
tempered transitions algorithm described in i ]3. 1.21 with Tmax = 8 
and 16 logarithmically-spaced temperature levels. The Metropolis- 
Hastings transition probability is a uniform ball whose widths are 
adjusted to achieve an acceptance rate of approximately 0.15. The 
RJMCMC algorithm puts nearly all of the probability on the two- 
and three-component subspaces. The posterior distribution of com- 
ponent centres are accurately constrained to the input values in the 
correct subspace. Table |2]reveals a sharp transition between prefer- 
ring three to two components at an amplitude ratio of the tertiary to 
secondary amplitude, ^13/^2, at 0.3 (Model 3) and below. Figure 
[T] shows that this procedure reflects our expectation that RJMCMC 
should identify three separate components when one can do so by 
eye. Of course, the subtle visual asymmetries that allows us to do 
this would be obscured by noise that would be included in a realis- 
tic model. 

In summary, RJMCMC allows one to identify the number of 
components in a mixture without using Bayes factors. The simple 
set of transitions used here are likely to work well for a wide vari- 
ety of model families since they depend on the mixture nature, not 
properties of the underlying model families. Unlike Bayes factors, 
RJMCMC simulations may not be reused for model comparisons 
in light of a new model; rather, the RJMCMC simulation must be 
repeated including the new model. 



3.4 Convergence testing 

The BIE provides extensible support for convergence testing. Con- 
vergence testing has two goals: (1) determining when the simula- 
tion is sampling from the posterior distribution, and (2) determin- 
ing the number of samples necessary to represent the distribution. 
Here, I address the first goal. For mult i ple ch ains, the work horse is 
the commonly used lOelman & RubinI jl992h statistic. This method 
compares the interchain variance to the intrachain variance for an 
ensemble of chains with different initial conditions; the similarity 
of the two is a necessary condition for convergence. 



For single-chain algorithms, I have had good success with a 
diagnostic method that assesses the convergence of both marginal 
and joint posterior densities following Giakouinatos et al. (1993, 
hereafter GVDP). This method determines confidence regions for 
the posterior mean by batch sampling the chain. As the distribution 
converges, the distribution of the chain states about the mean will 
approach normality owing to the central limit theorem: the vari- 
ance as a function of l/\fN for a sample size A'^ will be linearly 
correlated for a converg ed simulation. This approach generalises 
lOelman & RubinI ( Il992h who used the coefficient of determina- 
tion (C.O.D., the sq uare of the Pears on's product-moment coiTela- 
tion coefficient, e.g. lPress et al.l 19921 § 14.5) to assess convergence. 
Moreover, GVDP use the squared ratio of the lengths of the empir- 
ical estimated confidence intervals for the parameters of interest as 
an alternative inteipretation of the R diagnostic. This alternative 
calculation of R is simpler to compute than the original ratio of 
variances and is free from the assumption of normality. 

For parallel chain applications, and especially for differential 
evolution (see i ]3.1.4| l. some chains will get stuck in regions of 
anomalously low posterior probability. Several outliers in a large 
ensemble of chains can be removed without harming the simula- 
tion as long as the chains are independent. 

3.5 Goodness-of-flt testing 

As described in ij2] model assessment is an essential component of 
inference. I have explored two approaches: Bayesian p- values and 
Bayes factors for a non-parametric model. The first is easily applied 
but is a qualitative indicator only. The second has true power as a 
hypothesis test but is computationally intensive. 



3.5.1 Posterior predictive checking 

Once one has successfully simulated the posterior distribution, one 
may predict future data points easily. The predicted distribution of 
some future data D'"^^'' after having observed the data D is 



p(D 



pred\ 



D) 



p(D^'-'='*|0,D)p(0|D)d0, 



(21) 



called the posterior predictive distribution. In the last integral ex- 
pression in equation ( I21t . p{D^^^'^\9 , D) is the probability of ob- 



© 2012 RAS, MNRAS 000,[T]tT7] 



Bayesian Inference Engine 1 1 



serving DP'"'^'^ given the model parameter 6 and observed data set 
D. The distribution p{0\D) is the posterior distribution. For many 
problems of interest, the probability of observing some new data 
given the model parameter 9 is independent of the original D. In 
many cases, p(D'"''^'*|0, D) will be the standard likelihood func- 
tion p(D'"^'^''j0), simplifying equation Ml\ . One may simulate the 
posterior predictive distribution using an existing MCMC sample 
as follows: 1) sample m values of from the posterior; 2) for each 
in a posterior set, sample a value of D'"^'^'^ from the likelihood 
p(£)predjg-j jj^g ^ values of D*""'* represent samples from the 
posterior predictive distribution p(D'"^'''*|D). 

One can attempt to check specific model assumptions with 
posterior predictive checks (PPC) using the posterior predictive dis- 
tribution. The idea is simple: if the model fits, predicted data gen- 
erated under the model should look similar to the observed data. 
That is, the discrepancy measure applied to the true data should 
not lie in the tails of the predicted distribution. If one sees some 
discrepancy, does it owe to an inappropriate model or to random 
variance? To answer this question (following Gelman 2003i, and 
references therein), one generates M data sets, D^"^*^ , ■ ■ ■ , D^^*^ 
from the posterior predictive distribution p(D'"^^''|D). Now one 
chooses some number of test statistics T(D, 0) that measure the 
discrepancy between the data and the predictive simulations. These 
discrepancy measures can depend on the data D and the parameters 
and hyperparameters 6, which is different from standard hypothe- 
sis testing where the test statistic only depends on the data, but not 
on the parameters. The discrepancy measures T(D, 0) need to be 
chosen to investigate deviations of interest implied by the nature 
of the problem at hand. This is similar to choosing a powerful test 
statistic when conducting a hypothesis test. Any chosen discrep- 
ancy measure must be meaningful and pertinent to the assumption 
you wan t to test. Examp les of this approach using the BIE may be 
found in lLu et 



St. bxamp i 
ai]l l20ia) 



3.5.2 Non-parametric tests 

This class of goodness-of-fit tests weights a parametric null hypoth- 
esis against a non-parametric alternative. For example, one may 
wish to test the accuracy of an algorithm that has produced n inde- 
pendent variates 0i:„ = {9i,92, ■ ■ ■ , On) intended to be normally 
distributed. One would test the null hypothesis that the true density 
is the normal distribution A/'(/i, a'^) against a rich non-parametric 
class of densities by placing a prior distribution on the null and al- 
ternative hypotheses and calculating the Bayes factor. This leads to 
difficult, high-dimensional calculations. 

For the BIE, I adopt a rem arkably clever method proposed by 
IVerdinelli & WassermanI 1 119981 hereafter VW) to perform a non- 
parametric test without proposing alternative models directly. The 
VW approach is based on the following observation: since the cu- 
mulative distribution function for a variable 6, F{6), is strictly in- 
creasing and continuous, the inverse F~^{u) for u G [0, 1] is the 
unique real number 6 such that F{9) = it. In the multivariate 
case, the inverse of the cumulative distribution function will not 
be unique generally, but, instead, one may define 



F-\u) 



inf {F{e) > u} 



(22) 



for a parameter vector 9 of rank d. Then, rather than defining a gen- 
eral class of densities in R** to propose the alternative, VW consider 
a functional perturbation to F, G{F{6)) say, such that G maps the 
unit interval onto itself. The identity, G{u) = u, is the unperturbed 
probability distribution. Then, the test evaluates the uniformity of 



Table 3. Marginal likelihood values for Verdinelli-WasseiTnan tests de- 
scribed in ii3.5.2l 



Class 



Model type lnP{D\M) sjj 



VW(1) 
Fiducial (2) 



Gaussian 
Gaussian 



586.7tJ;;5i 

,+0.01 
-0.01 



588.1"' 



-1.4 



+ 1.6 
0.02 



VW(1) 
Fiducial (2) 



Power law 
Power law 



588.7tt.^ 

:;+0.01 
-0.01 



114.5"' 



474.2"* 



t Following the definition in i|2.3l B12 denotes the odds that 
Model 1 is more likely than Model 2. 

the distribution of probabilities under each hypothesis. To be con- 
crete, let Fo describe the probability of sample 6 under hypothe- 
sis Hq. That is, Fo is the cumulative distribution function of the 
posterior distribution. Under Ho, one expects each random variate 
Fo{0j) for all j to be independent and identically distributed (iid) 
as a uniform distribution in the unit interval. This may be notated 
as follows: 

Ho : Fo(6>i), Fo{02), , Fo(6>„) ~ Uniform(0, 1). 
Alternatively, a poor fit will satisfy the hypothesis Hi : 

iid 

Hi : Fo(6>i),Fo(6»2),,Fo(6»„) / Unifoi-m(0, 1). 

To construct the functional perturbation G, 

IVerdinelli & WassermanI use a sequence of Legendre polyno- 
mials, {Cj(')ii = 1)2,...}, defined over the unit interval to 
construct infinite exponential densities of the form 



g{u\ip) = exp 
tp — (i/)i, tp2 ■ ■ ■) are coefficients and 



U=i 



cW 



c{Tp) — log / ditexp 



X^^-'^J^") 



U=i 

is a normalising constant. VW put priors on ^, given by tpj ~ 
A/'(0, r^/c|) where r and the Cj are appropriately chosen con- 
stants. VW also specify a hyperprior on r: r ~ J\f{0,w'^) trun- 
cated to the positive values. This distribution provides finite prob- 
ability for obtaining the null hypothesis near t — and decreases 
monotonically for larger perturbations from the null, maintaining 
the perturbative nature of the alternative hypothesis. 

Intuitively, this development is closely related to the prob- 
abilistic interpretation of the marginal likelihood and Bayes fac- 
tors. To see this, consider the one-dimensional case for simplicity: 
let /(Dje) = P(6l)P(D16l) and Fo{e) = j'^^dO f{T)\e) and 
P(D) = Fo(oo). If the distribution of Fo{9i) for {Oi} is not uni- 
form in [0, 1], one can perturb /(Dj6) by moving some density 
from a region of under sampling to a region of over sampling and, 
thereby, increase F(D). 

For an example, I apply the VW method to the following two 
models defined by a two-dimensional normal distribution and by 
a power-law-like distribution with unknown centres and widths in 
each dimension: 



POauseix, y; 6x,0y,ax,IJy) 

Ppower {x,y;6x,0y,ax,o-y,a) 



( 1 



-r''/2 



\2-Kaj:Oy 

( 1 \ a(H-Q 



(23) 



yi-Ka^ay) (H-r)2+°= 

(24) 
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where r = x /a^ + y jOy and a > 0. For the examples here, 
I adopt a = 1. I take the Gaussian model, Paausa, or power-law 
model, Ppower, to be the null hypothesis (denoted 0). For each 
model, I assume the centres are normally distributed with zero 
mean and unit variance and that the variance is WeibuU distributed 
(eq. lB2t with scale 0.03 and shape of 1. A 1000-point data set is 
sampled from a two-dimensional normal distribution centred at the 
origin with a root variance of each dimension of 0.03. 

I take the same model with the VW extension with n = 5 basis 
functions to be the alternative hypothesis (denoted 1) and test the 
support for the two hypotheses using Bayes factors. Although VW 
recommend the choice w — 1 for the hyperprior, I found that this 
tends to overfit the extended model, not favouring the model when 
it is correct. Rather I adopt w — 4 which suppresses this tendency. 
I performed four MCMC simulations with and without the VW ex- 
tension and with both the Gaussian and power-law models. Each 
used the tempered differential evolution algorithm (see i]3.1.4| ) with 
32 chains and Tmax = 32 to obtain two million converged states. 
Then, the posterior samples were batched into groups of 250,000 
states and the marginal likelihood was computed using the algo- 
rithms described in i]3.2l The 90% credible interval was computed 
by bootstrap analysis from the batches. The results are summarised 
in Table [3] One sees from the table that the true model is preferred 
to the extended VW model, but only mildly. Conversely, the power- 
law model is strongly disfavoured relative to the extended model for 
the normal data sample. Also note: the marginal likelihood value 
for the normally distributed data given the normal model (Row 1 
in the table) has almost the same value as the extended VW power- 
law model (Row 4 in the table). As pointed out by VW, the ex- 
tended model provides a estimator of the true posterior density. The 
agreement of these two values suggests that the five basis functions 
provide sufficient variation to reproduce the true value and that the 
ten-dimensional numerical evaluation of the marginal likelihood in- 
tegral using the methods described in i]3.2l is sufficiently accurate. 



4 CASE STUDIES 

4.1 Semi-analytic galaxy formation models: BIE-SAM 

Many of the physical processes parametrised in semi-analytical 
models of galaxy formation remain poorly understood and under 
specified. This has two critically important consequences for infer- 
ring constraints on the physical parameters: 1) prior assumptions 
about the size of the domain and the shape of the parameter dis- 
tribution will strongly affect any resulting inference; and 2) a veiy 
large parameter space must be fully explored to obtain an accurate 
inference. Both of these issues are naturally tackled with a Bayesian 
approach that allows one to c onstrain the theo ry with data in a prob- 
abilistically rigorous way. In lLu et al.l ( l201lh . we presented a semi- 
analytic model (SAM) of galaxy formation in the framework of 
Bayesian inference and illustrated its performance on a test prob- 
lem using the BIE; we call the combined approach BIE-SAM. Our 
sixteen-parameter semi-analytic model incorporates all of the most 
commonly used parametrizations of important physical processes 
from existing S AMs including star formation, SN feedback, galaxy 
mergers, and AGN feedback. 

To demonstrate the power of this approach, the thirteen of 
these parameters that can be co nstrained by the K-band luminos- 
ity function were investigated in lLu et al.l ( 120 llh . We find that the 
posterior distribution has a very complex structure and topology, 
indicating that finding the best fit by tweaking model parameters 
is improbable. As an example. Figure |2] describes isosurfaces of 



the posterior distribution in three of thirteen dimensions. The sur- 
faces have a complex geometry and are strongly inhomogeneous in 
any parameter direction. Moreover, the posterior clearly shows that 
many model parameters are strongly covariant and, therefore, the 
inferred value of a particular parameter can be significantly affected 
by the priors used for the other parameters. As a consequence, one 
may not tune a small subset of model parameters while keeping 
other parameters fixed and expect a valid result. 

Apropos the discussion in i]2.4| by using synthetic data to 
mimic systematic uncertainties in the reduced data, we also have 
shown that the resulting model parameter inferences can be signif- 
icantly affected by the use of an incorrect error model. We used a 
synthetically-generated binned stellar mass function and performed 
two inferences: one with a realistic covariance and one with no 
off-diagonal covariance. The contours with the full covariance ma- 
trix are more compact, but there are also noticeable changes in 
the shape and orientation of the posterior distribution. This clearly 
demonstrates that an accurate analysis of errors, both sampling er- 
rors and systematic uncertainties, are crucial for observational data, 
and conversely, a data-model comparison without an accurate error 
model is likely to be erroneous. 

The method developed here can be straightforwardly applied 
to other data sets and to multiple data sets simultaneously. In ad- 
dition, the Bayesian approach explicitly builds on previous re- 
sults by incoiporating the constraints from previous inferences into 
new data sets; the BIE is designed to do this automatically. For 
many processes in galaxy formation, competing models have been 
proposed but not quantitatively compared. Bayes factor analyses 
(see i]3.2l l or explicit mo del comparis on techniques such as the re- 
versible jump algorithm ( lGreenlll995L see also i]3.3t can provide a 
quantitative comparison of different models for a given data set. 



4.2 Galphat 

lYoon et alj j201lh describes Galphat (GALaxy PHotometric AT- 
tributes), a Bayesian galaxy image analysis package built for the 
BIE, designed to efficiently and reliably generate the posterior 
probability distribution of model parameters given an image. From 
the BIE point of view, both Galphat and the BIE-SAM are like- 
lihood functions. The Galphat likelihood function is designed to 
produce high-accuracy photometric predictions for galaxy models 
in an arbitrary spatial orientation for a given point-spread function 
(PSF). Accurate predictions in both the core and wings of the im- 
age are essential for reliable inferences. The pixel predictions are 
computed using an adaptive quadrature algorithm to achieve a pre- 
defined eiTor tolerance. The rotation and PSF convolution are per- 
formed on sub-sampled grids using an FFT algorithm. Galphat can 
incorporate any desired galaxy image family. For speed, we pre- 
compute subsampled grids for each model indexed by parameters 
that influence their shape. The desired amplitude and linear scale 
are easily computed through coordinate transformations. To enable 
this, the images are stored as two-dimensional cumulative distri- 
butions. Our cuiTent implementation uses Sersic (1963) models for 
disk, bulge and spheroid components, however, this approach is di- 
rectly applicable to any model family. 

Using the various tempering algorithms available in the BIE, 
our tests have demonstrated that we can achieve a steady-state dis- 
tribution and that the simulated posterior will include any multi- 
ple modes consistent with the prior distribution. Given the pos- 
terior distribution, we may then consistently estimate the credi- 
ble regions for the model parameters. We show that the surface- 
brightness model will often have correlated parameters and, there- 
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Figure 2. The likelihood function for 3 out of the 13 free parameters in the BIE-SAM from lLu et alj 1201 ih : the star-formation threshold surface density fsF, 
the star-formation efficie ncy power-law i ndex agp, and the supernova feedback energy fraction cegpf. The blue (red) surfaces enclose approximately 10% 
(67%) of the density. See lLu et al] )201lh for additional details. 
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Figure 3. The size-Sersic index relation inferred from a synthetically-generated sample of elliptical galaxy images Left: a scatter plot using the best-fit 
parameters. The red curve shows a smooth fit to the ridge-hne of the points. Right: the marginal posterior density for the same parameters. While the left- 
hand plot suggests that small galaxies have low concentrati ons, the right-hand plot of posterior density correctly reveals that this trend is an artifact of the 
model-fitting procedure. These figures originally appeared in lYoon et alj J201 ih . 



fore, any hypothesis testing that uses the ensemble of posterior in- 
formation will be affected by these coiTelations. The full posterior 
distributions from Galphat identify these correlations and incorpo- 
rate them in subsequent inferences. 

These issues are illustrated in Figures |3] and |4] Figure |3] 
shows the size-Sersic index relation inferred from a synthetically- 
generated sample of elliptical galaxy images. The left-hand panel 
shows the traditional scatter diagram of maximum posterior param- 



eter values; that is, this figure plots the effective radius and Sersic 
index for the maximum likelihood value obtained for each image 
fit. The red curve is a smooth estimate of the trend. The right-hand 
panel shows the inferred distribution based on the full posterior dis- 
tribution of the ensemble after marginalising over all of the param- 
eters but the effective radius and the Sersic index. The left-hand 
panel incorrectly suggests that smaller galaxies are less concen- 
trated while the right-hand panel correctly reveals that the size and 
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Figure 4. Marginal posterior densities for 100 galaxies with a signal-to-noise ratio of 20 and randomly sampled Sersic model parame- 
ters. The values are magnitude difference from the input values (AMAG), galaxy half-Ught radius scaled by the input values (r*/re), 
the Sersic index difference (An) and the sky value difference (Asky in percent). The blue, red and grey curves and contours show 
galaxies with n > 2, n < 2 and the total sample, respectively. The parameter covariance depends on both the signal-to-noise ratio and 
the Sersic index. 



concentration are uncorrelated. Figure 2]illustrates the correlation 
between parameters for low-concentration galaxies (Sersic index 
n < 2) in blue and high-concentration galaxies (?i > 2) in red with 
the total in grey scale. 

We can use posterior simulations over ensembles of images 
to test, for example, the significance of cluster and field environ- 
ments on galaxies as evidenced in their photometric parameters, 
such as the correlation between bulge-to-disk ratio and environ- 
ment. A more elaborate example might include models for higher 
angular harmonics of the light distribution and we could determine 
the support for these in the data using Bayes factors ( ii3.2t . 



5 DISCUSSION AND SUMMARY 

Advances in digital data acquisition along with storage and retrieval 
technologies have revolutionised astronomy. The promise of com- 
bining these vast archives have led to organisation frameworks such 
as the National Virtual Observatory (NVO) and the International 
Virtual Observatory Alliance (IVOA). To realise the promise of this 
vast distributed repository, the modem astronomer needs and wants 
to combine multi-sensor data from various surveys to help con- 
strain the complex processes that govern the Universe. The neces- 



sary tools are still lacking, and the Bayesian Inference Engine (BIE) 
was designed as a research tool to fill this gap. This paper outlines 
the motivation, goals, architecture, and use of the BIE and reports 
our experience in applying MCMC methods to observational and 
theoretical Bayesian inference problems in astronomy. 

Most researchers are well-versed in the identifying "best" pa- 
rameters for a particular model for some data using the maxi- 
mum likelihood method. For example, consider the fit of a surface 
brightness model to galaxy images. Parameters from the maximum- 
likelihood solutions are typically plotted in a scatter diagram and 
trends are interpreted physically. However, plotting scatter dia- 
grams from multiple data sources inadvertently mixes error models 
and selection effects. i ]2. II described the pitfalls of this approach. 
Rather, the astronomer wants to test the hypothesis that the data 
is correlated with a coefficient larger than some predetermined 
value a, a complex hypothesis test. However, without incorporat- 
ing the coiTelations imposed by both the theoretical model, the er- 
ror model, and the selection process, the significance of the test 
is uncertain. Similarly, the astronomer needs methods of assess- 
ing whether a posited model is correct. I have divided these needs 
into two categories: goodness-of-fit tests ( i]2.2b and model selec- 
tion ( i]2.3b . As an example of the former, the astronomer may have 
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found the best parameters using maximum likelihood, but does the 
model fully explain all of the features in the data? If it does not, 
one must either modify or reject the model before moving on to the 
next step. As an example of the latter, suppose an inference results 
in two parameter regions or multiple models that explain the data. 
Which model best explains the data? 

All of these wants and needs — combining data from multi- 
ple sources, estimating the probability of model parameters, assess- 
ing goodness of fit, and selecting between competing models — are 
naturally addressed in a single probabilistic framework known as 
Bayesian inference. In particular, Bayesian inference provides a 
data-first discipline that demands that the eiTor model and selec- 
tion effects are specified by the probability distribution for the data 
given the model A4, P{T)\9,A4), colloquially known as the like- 
lihood function L('D\9,A4). Prior results including quantified ex- 
pert opinion are specified in the prior probability function ■k{9\M). 
The inferential computation may be incremental: the data may be 
added in steps and new or additional observations may be moti- 
vated at each step, true to the scientific method. In the end, this 
approach may be generalised to locating the most likely models in 
the generalised space of models; this leads to goodness-of-fit and 
model comparison tests. 

For scientists, the ideal statistical inference is one that lets the 
data "speak for themselves." This is achievable in some cases. For 
example, estimating a small number of parameters given a large 
data set tends to be independent of prior assumptions. On the other 
hand, hypothesis tests of two complex models may depend sensi- 
tively on prior assumptions. For a trivial example, some choices 
of parameters may be unphysical even though they yield good fits 
and should be excluded from consideration. Moreover, if two com- 
peting models fit the data equally well, any hypothesis test will 
be dominated by prior information. Inferences based on realistic 
models of astronomical systems will often lie between these two 
extremes. For these cases, I advocate Bayesian methods because 
they precisely quantify both the scientists' prior knowledge and the 
information gained through observation. In other words, we allow 
the data to "speak for themselves" but in a "dialect" of our choos- 
ing. Philosophy aside, Bayes theorem simply embodies the law of 
conditional probability and provides a rigorous framework for com- 
bining the prior and derived information. 

With these advantages comes a major disadvantage: Bayesian 
inference is computationally expensive! An inference may require 
a huge sample from the posterior distribution and real- world com- 
putation of P(D|0, A1) is often costly. Moreover, naive MCMC 
algorithms used for sampling the posterior distribution converge 
unacceptably slowly for distributions with multiple modes, and 
advanced techniques to improve the mixing between modes are 
needed, increasing the expense. Finally, Bayesian inference re- 
quires integrals over typically high-dimensional parameter spaces. 
For example, Bayes factors ( i]2.3t require the computation of the 
marginal likelihood: 

P{D\M)= f den{0\M)P{'D\e,M). 

Evaluation of this integral suffers from the curse of dimensionality. 
Nonetheless, the elegance and promise of Bayesian inference 
motivated us to attempt a computational solution and this became 
the BIE project. The algorithms and techniques described here, all 
and more available in the BIE, have proved useful to address the 
complications found in research problems. In short, the BIE fills 
a gap between tools developed for small-scale problems or those 
designed to test new algorithms and a computational platform de- 
signed for production-scale inference problems typical of present- 



day astronomical survey science. Its primary product is a sample 
from a posterior distribution to be used for parameter estimation 
and model selection. Other Bayesian applications, such as non- 
parametric inference and clustering, should be possible with little 
modification, but have not yet been investigated. The BIE is de- 
signed to ran on high-performance computing clusters, although it 
will also run on workstations and laptops. 

The open object-oriented architecture allows for cross- 
fertilisation between researchers and groups with both mathemat- 
ical and scientific interests, e.g. both those developing new algo- 
rithms and those developing new astronomical models for differ- 
ent applications. New classes contributed by one become available 
to all users after an upgrade. Approaches implemented by the one 
user's new classes may solve an unanticipated set of problems for 
other users. In this way, the BIE is a distributed collaborative sys- 
tem similar to packages like IDL or modules in Python. I anticipate 
that users with a variety of technical skill levels will use the BIE. By 
reusing and modifying supplied examples, a user's model of a like- 
lihood function can be straightforwardly added to the system with- 
out any detailed knowledge of the internals. A user's new model 
becomes an internally-documented first-class object within the BIE 
by following the examples as templates. There is also room for the 
experienced programmer to improve the low-level parallelism or 
implement more efficient heuristics for likelihood evaluations. The 
BIE includes a full persistence subsystem to save the state and data 
for running a MCMC simulation. This facilitates both checkpoint- 
ing and recovery as well as later use of inferred posterior distribu- 
tions in new and unforeseen ways. A future version will implement 
a built-in database for warehousing results including the origin and 
history of both the data and computation, along with labels, notes 
and comments. Altogether, this will constitute an electronic note- 
book for Bayesian inference. 

This paper provides some concrete examples of parameter es- 
timation and model comparison using the BIE. These do not ex- 
haust the full BIE feature set but, rather, provide a quick intro- 
duction to illustrate and motivate its use. The BIE provides the 
astronomer an organisational and computational schema that dis- 
criminates between models or hypotheses and suggests the best use 
of scarce observational resources. In short, if we can make better 
use of the interdependency of our observations given our hypothe- 
ses, then we can generate a far clearer picture of the underlying 
physical mechanisms. In many ways, the Bayesian approach em- 
ulates the empirical process: begin with a scientific belief or ex- 
pectation, add the observed data and then modify the extent of that 
belief to generate the next expectation. In effect, as more and more 
data is added to the model, the accurate predictions become rein- 
forced and the inaccurate ones rejected. This approach relieves the 
scientist from directly confronting the complex interdependencies 
within the data since those interdependencies are automatically in- 
coiporated into the model. Although our scientific motivations are 
astronomical, the BIE can be applied to many different complex 
systems and may even find applications in areas as diverse as bio- 
logical systems, climate change, and finance. 
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APPENDIX A: BIE: TECHNICAL OVERVIEW 

The BIE is a general-purpose system for simulating Bayesian 
posterior distributions for statistical inference and has been 
stress tested using high-dimensional models. As described in 
the previous sections, the inference approach uses the Bayesian 
framework enabled by Monte Carlo Markov chain (MCMC) 
techniques. The software is parallel and multi-threaded and 
should run in any environment that supports POSIX threads and 
the widely-implemented Message Passing Interface (MPI, see 
http://www.mpi-forum.org). The package is written in 
C-H- and developed on the GNU/Linux platform but it should port 
to any GNU platform. 

BIE at its core is a software library of interoperable compo- 
nents necessary for performing Bayesian computation. The BIE 
classes are available as both C-l~l- libraries and as a stand-alone sys- 
tem with an integrated command-line interface. The command-line 
interface is well tested and is favoured by most users so far. A user 
does not need to be an expert or even an MPI programmer to use the 
system; the simple user interface is similar to MatLab or Gnuplot. 
In addition to the engine itself, the BIE package includes a num- 
ber of stand-alone programs for viewing and analysing output from 
the BIE, testing the convergence of a simulation, manipulating the 
simulation output, and computing the marginal likelihood using the 
algorithms described in ii3.2| A future release will include wrappers 
for Python. 

Al Software architecture 

As in GIMP Toolkit (GTK-I-) and the Visualisation Toolkit (VTK), 
the BIE uses C-l~l- to facilitate implementing an object-oriented de- 
sign. Object-oriented programming enforces an intimate relation- 
ship between the data and procedures: the procedures are responsi- 
ble for manipulating the data to reflect its behaviour. The software 
objects (classes in C++) represent real-world probability distribu- 
tions, mathematical operators and algorithms, and this presents a 
natural interface and set of interobject relationships to the user and 
the developer. Programs that want to manipulate an object have to 
be concerned only about which messages this object understands, 
and do not have to worry about how these tasks are achieved nor 
the intemal stmcture of the object. 

Another powerful feature of object-oriented programming is 
inheritance (derived classes in C-l~l-). The derived class inherits the 
properties of its base class and also adds its own data and routines. 
This stmcture makes it is easy to make minor changes in the data 
representation or the procedures. Changes inside a class do not af- 
fect any other part of a program, since the only public interface 
that the external world has to a class is through the use of methods. 
This facilitates adding new features or responding to changing op- 
erating environments by introducing a few new objects and mod- 
ifying some existing ones. These features encourage extensibility 
through the reuse of commonly used structures and innovation by 
allowing the user to connect components in new and possibly un- 
foreseen ways. In addition, this facilitates combining concurrently 
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developed software contributions from scientists interested in spe- 
cific models or data types and MCMC algorithms. 

Motivated by handling large amounts of survey data with the 
subsequent possible need to investigate the appropriate simulation 
for a variety of models or hypotheses, the BIE separates the com- 
putation into a collection of subsystems: 

(i) Data input and output 

(ii) Data distribution and spatial location 

(iii) Markov chain simulation 

(iv) Likelihood computation 

(v) Model and hypothesis definition 

Each of these can be specified independently and easily mixed and 
matched in our object-oriented architecture. The cooperative de- 
velopment enabled by this architecture is similar to that behind the 
open-source movement, and this project is a testament to its success 
in an interdisciplinary scientific collaboration. The BIE use SVN 
version management (autoconf, automake), GNU coding standards, 
and DejaGNU regression testing to aid in portability. Moreover, I 
believe that this same social model will extend to remote collabo- 
rative efforts as the BIE project matures. 



APPENDIX B: DEFINITION OF DISTRIBUTIONS 

Bl Normal distribution 

The normal (or Gaussian) distribution is the familiar bell-shaped 
density function that results from the central limit theorem: 



M{e-^i,o^)^ 



gV2^ 



(Bl) 



where parameter /i is the mean and a is the variance. See any 
standard reference in probability and statistics for the multivariate 
generalisation. 

B2 Weibull distribution 

The Weibull distribution has a scale parameter (A) and shape pa- 
rameter K. It includes the exponential distribution for k = 1 but 
becomes more peaked around around A as k increases. The proba- 
bility density function of a Weibull random variable 9 is: 

^ ^ |o e<Q. 



A2 Persistence system 

The researcher needs to be able to stop, restart, and possi- 
bly refocus inferential computations for both technical and sci- 
entific reasons. The BIE was designed with these scenarios 
in mind. The BIE's persistence system is built on top of the 
BOOST (|http : //www. boost .o rg^ serialisation library. The 
BIE classes inherit from a base serialisation class that provides the 
key serialisation members and a simple mnemonic scheme to mark 
persistent data in newly developed classes. The serialisation-based 
persistence system avoids irrevocably modifying data sets or files. 
Rather it views the computation as a series of functions, each ac- 
cepting one or more input data sets and producing one or more new 
output data sets. The system records and timestamps these com- 
putations and the relationships between inputs and outputs in an 
archive research log, so that one can always go back and determine 
the origin of data and how it was processed. The most common use 
of BIE persistence to date is checkpointing and recovery, Check- 
pointing guards against loss of computation by saving intermediate 
data to support recovery in the middle of long-running computa- 
tional steps; and it allows one to "freeze" or "shelve" a computa- 
tion and pick it up later. It also provides the basic support needed 
to interrupt a computation, do some reconfiguring, and resume, as 
when machines need to be added to or removed from a cluster, etc. 



B3 Beta distribution 

The beta distribution is a two-parameter density defined on the unit 
interval. The probability density function of the beta distribution is: 

Ps{x-c.,p)^^^f^±§-x-\l-xY-' (B3) 

r(a)r(/3) 

where r(-) is the gamma function. 

B4 Dirichlet distribution 

The Dirichlet distribution is the multivariate r-dimensional gener- 
alization of the beta distribution with parameters a; , i = 1, . . . , r. 
Thinking of each dimension as a separate type of event i, this distri- 
bution describes the probability of events where each type of event 
has been previously observed a^ — 1 times. The Dirichlet distribu- 
tion of order r > 2 with parameters a; has the probability density 
function: 



Pt>{xi, 
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(B4) 



where Xi, . . . , a::r-i > and x^ = 1 — X]i=i •''» '^ ^■ 



A3 Extensibility 

BIE is designed to be extensible. The user may define new classes 
for any aspect of the MCMC simulation, such as MCMC algo- 
rithms, convergence tests, prior distributions, data types, and likeli- 
hood functions. The code tree includes a Projects directory that 
is automatically compiled into any local build that may contain any 
locally added functionality. For example, both the BIE-SAM and 
Galphat were derived from a base class specifically for user- defined 
likelihood functions. 

The source tree is available for download from 
|http : //www, astro .umass . edu/bie| The package 
includes Debian and Ubuntu package management scripts so 
that local . deb packages may be built. Users have had success 
building other modem Linux distributions. 
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